Adjusted Rand Index

2025-03-31
별칭: ARI

The Rand Index (RI) computes how similar two clusterings are by counting how many pairs of samples are assigned consistently in both clusterings (either in the same cluster or in different clusters).

The Adjusted Rand Index (ARI) corrects this for chance — it adjusts for the fact that some agreement between clusterings might happen randomly.

Example

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import adjusted_rand_score
from scipy.cluster.hierarchy import linkage, fcluster, dendrogram
import matplotlib.pyplot as plt

# 1. Sample data
texts = [
    "dog barks loudly",        # Label 0 - Animal
    "cat meows at night",      # Label 0 - Animal
    "puppy plays with ball",   # Label 0 - Animal

    "car drives fast",         # Label 1 - Vehicle
    "truck carries cargo",     # Label 1 - Vehicle
    "bus transports people",   # Label 1 - Vehicle

    "pizza has cheese",        # Label 2 - Food
    "burger with lettuce",     # Label 2 - Food
    "pasta with tomato sauce"  # Label 2 - Food
]

true_labels = [0, 0, 0, 1, 1, 1, 2, 2, 2]

# 2. Convert text to TF-IDF vectors
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts).toarray()

# 3. Perform hierarchical clustering (you can vary 'ward', 'average', etc.)
Z = linkage(X, method='ward')

# 4. Optional: visualize dendrogram
plt.figure(figsize=(8, 4))
dendrogram(Z, labels=true_labels)
plt.title("Dendrogram")
plt.xlabel("Sample index")
plt.ylabel("Distance")
plt.show()

# 5. Cut the dendrogram to form 3 clusters
num_clusters = 3
predicted_labels = fcluster(Z, num_clusters, criterion='maxclust')

# 6. Evaluate clustering using Adjusted Rand Index
ari = adjusted_rand_score(true_labels, predicted_labels)
print("Adjusted Rand Index (ARI):", ari)